Improving Classification of Multi-Lingual Web Documents using Domain Ontologies
نویسندگان
چکیده
In this paper, we deal with the problem of analyzing and classifying web documents to several major categories/classes in a given domain using domain ontology. We present the ontology-based web content mining methodology that contains such main stages as collecting a training set of labeled documents from a given domain, building a classification model above this domain given the domain ontology, and classification of new documents via the induced model. We tested the proposed methodology in a specific domain, namely web pages containing information about production of certain chemicals. Using our methodology, we are interested to identify all relevant web documents while ignoring the documents that do not contain any relevant information. Our system receives as input an OWL file built in Protege tool, which contains the domain-specific ontology, and a set of web documents classified by a human expert as ”relevant” or ”non-relevant”. We use a language-independent key-phrase extractor with integrated ontology parser (defined in a given language) for creating the database from input documents and use it as a training set for the classification algorithm. The system classification accuracy using various levels of ontology is evaluated.The current version of our system supports web content mining in English, Arabic, Russian, and Hebrew languages.
منابع مشابه
Classification of Web Documents Using Concept Extraction from Ontologies
In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed onto...
متن کاملUsing Multiple Related Ontologies in an Fuzzy Information Retrieval Model
With the Semantic Web progress many independently developed distinct domain ontologies have to be shared and reused by a variety of applications. The use of ontologies in information retrieval applications allows the retrieval of semantically related documents to an initial users’ query. This work presents a fuzzy information retrieval model for improving the document retrieval process consider...
متن کاملEnriching Ontologies with Encyclopedic Background Knowledge for Document Indexing
The rapidly increasing number of scientific documents available publicly on the Internet creates the challenge of efficiently organizing and indexing these documents. Due to the time consuming and tedious nature of manual classification and indexing, there is a need for better methods to automate this process. This thesis proposes an approach which leverages encyclopedic background knowledge fo...
متن کاملMultilingual Medical Documents Classification Based on MesH Domain Ontology
This article deals with the semantic Web and ontologies. It addresses the issue of the classification of multilingual Web documents, based on domain ontology. The objective is being able, using a model, to classify documents in different languages. We will try to solve this problematic using two different approaches. The two approaches will have two elementary stages: the creation of the model ...
متن کاملMapping Persian Words to WordNet Synsets
Lexical ontologies are one of the main resources for developing natural language processing and semantic web applications. Mapping lexical ontologies of different languages is very important for inter-lingual tasks. On the other hand mapping approaches can be implied to build lexical ontologies for a new language based on pre-existing resources of other languages. In this paper we propose a sem...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005